Skip to content

feat(runners): add mi325x-vultr launch script#1738

Closed
Oseltamivir wants to merge 6 commits into
mainfrom
add-mi325x-vultr-runner
Closed

feat(runners): add mi325x-vultr launch script#1738
Oseltamivir wants to merge 6 commits into
mainfrom
add-mi325x-vultr-runner

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Add runners/launch_mi325x-vultr.sh for the vultr mi325x fleet. Modeled on launch_mi325x-amds.sh (same SKU, same compute partition, same single-node salloc/import/srun flow and *_mi325x.sh bench invocation), with the two cluster-specific paths:

  • enroot cache (import layer cache + imported .sqsh) at /enroot/sa
  • pre-staged model weights / HF hub cache at /nfsdata/sa/models/, bind-mounted over the container HF_HUB_CACHE so hf download "$MODEL" reuses the staged models--org--name caches instead of re-downloading from HF.

Both paths are node-local ext4 at the same path on every compute node; import and run share one Slurm job on a single node, so node-local storage suffices.


Note

Low Risk
Changes are additive benchmark/CI infrastructure (configs, launcher, shell recipe) with no production auth or data-path logic; main risk is long, resource-heavy CI sweeps on new hardware.

Overview
Adds day-zero MiniMax-M3 MXFP8 single-node vLLM benchmarking on the Vultr MI325X fleet, alongside infrastructure to run it in CI.

A new mi325x-vultr runner pool (six GitHub runners) is wired to launch_mi325x-vultr.sh, which follows the existing MI325X Slurm/enroot flow but uses Vultr-specific enroot cache (/enroot/sa), staged HF hub cache bind-mount (/nfsdata/sa/models/), and Slurm node excludes for known-bad hosts.

minimaxm3-fp8-mi325x-vllm in amd-master.yaml registers MiniMaxAI/MiniMax-M3-MXFP8 on vllm/vllm-openai-rocm:minimax-m3 with fixed-seq-len sweeps (1k1k / 8k1k) over TP4/TP8, TEP (EP4/EP8), and DEP—TP2 is omitted vs B300 because ~444 GB MXFP8 would OOM on 256 GB GPUs.

minimaxm3_fp8_mi325x.sh implements the ROCm recipe: mandatory --block-size 128, TRITON_ATTN, --language-model-only, conc-scaled CUDA graphs, extended engine ready timeout, and standard MI325X ROCm env (AITER, HIP/Ray). perf-changelog.yaml documents the new config key.

Reviewed by Cursor Bugbot for commit 0bd8981. Bugbot is set up for automated code reviews on this repo. Configure here.

Add runners/launch_mi325x-vultr.sh for the vultr mi325x fleet. Modeled on
launch_mi325x-amds.sh (same SKU, same compute partition, same single-node
salloc/import/srun flow and *_mi325x.sh bench invocation), with the two
cluster-specific paths:

- enroot cache (import layer cache + imported .sqsh) at /enroot/sa
- pre-staged model weights / HF hub cache at /nfsdata/sa/models/, bind-mounted
  over the container HF_HUB_CACHE so `hf download "$MODEL"` reuses the staged
  models--org--name caches instead of re-downloading from HF.

Both paths are node-local ext4 at the same path on every compute node; import
and run share one Slurm job on a single node, so node-local storage suffices.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread runners/launch_mi325x-vultr.sh
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 441ba6d. Configure here.

image: vllm/vllm-openai-rocm:minimax-m3
model: MiniMaxAI/MiniMax-M3-MXFP8
model-prefix: minimaxm3
runner: mi325x

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong runner type in config

High Severity

The new Vultr MiniMax-M3 entry sets runner to mi325x, so CI schedules the AMDS fleet and launch_mi325x-amds.sh instead of mi325x-vultr and launch_mi325x-vultr.sh. Staged weights at /nfsdata/sa/models/ and enroot cache at /enroot/sa are never used for this config.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 441ba6d. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

Node chi-mi325x-pod1-027 fails SLURM resume/boot — salloc grants an
allocation then relinquishes it with "Something is wrong with the boot of
the nodes" (run 27454108525), gating the minimaxm3-fp8-mi325x canary and
thus the whole sweep. Add it to the --exclude list alongside the existing
pod1-121 exclusion until the node is repaired.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants